Data-Driven Customer Segmentation and Trend Analysis

Author

Zenan Dong

1 Introduction

1.1 Project Overview

This project aims to analyze customer transaction data from an online retail store, using clustering techniques to segment customers and time-series analysis to understand sales trends. The insights derived will help businesses:

  • Identify high-value customers who contribute the most revenue.

  • Distinguish occasional shoppers from loyal customers.

  • Optimize marketing strategies based on data-driven segmentation.

1.2 Business Value

Effective customer segmentation and sales forecasting can:

  • Improve marketing ROI by targeting the right customer groups.

  • Enhance customer retention by identifying and engaging high-value customers.

  • Optimize inventory planning by understanding sales trends over time.

# A tibble: 6 × 8
  Invoice StockCode Description Quantity InvoiceDate         Price `Customer ID`
  <chr>   <chr>     <chr>          <dbl> <dttm>              <dbl>         <dbl>
1 536365  85123A    WHITE HANG…        6 2010-12-01 08:26:00  2.55         17850
2 536365  71053     WHITE META…        6 2010-12-01 08:26:00  3.39         17850
3 536365  84406B    CREAM CUPI…        8 2010-12-01 08:26:00  2.75         17850
4 536365  84029G    KNITTED UN…        6 2010-12-01 08:26:00  3.39         17850
5 536365  84029E    RED WOOLLY…        6 2010-12-01 08:26:00  3.39         17850
6 536365  22752     SET 7 BABU…        2 2010-12-01 08:26:00  7.65         17850
# ℹ 1 more variable: Country <chr>

2 Data Preprocessing

2.1 Dataset Overview

The dataset includes the following transaction details:

  • Invoice Number – Unique identifier for each transaction.

  • StockCode & Description – Product ID and product name.

  • Quantity & Price – Number of items purchased and price per unit.

  • InvoiceDate – Timestamp of the transaction.

  • Customer ID – Unique identifier for each customer.

  • Country – The country of purchase.

2.2 Data Cleaning

To ensure the quality and reliability of our analysis, we perform the following preprocessing steps:

Remove missing Customer IDs

  • Transactions without customer IDs are discarded as they cannot be associated with specific customer behavior.

Filter out negative values

  • Transactions with negative quantities or prices (likely due to refunds or errors) are removed.

Calculate total spending per transaction

\[ Total Spend=Quantity×Price \]

Aggregate customer purchase behavior

  • Total Revenue per Customer: Sum of all purchases per customer.

  • Purchase Frequency: Number of transactions per customer.

  • Recency: Days since the last purchase.

2.3 Data Cleaning Steps

  • Removed missing Customer IDs: Transactions without customer IDs were discarded.

  • Filtered out negative values: Transactions with negative quantities or prices (likely returns) were removed.

  • Computed total spending per transaction: Total_Spend = Quantity × Price.

  • Aggregated customer data:

    • Total Revenue per Customer: Sum of all purchases.

    • Purchase Frequency: Number of transactions per customer.

    • Recency: Days since the last purchase.

# A tibble: 6 × 4
  `Customer ID` Recency Frequency Monetary
          <dbl>   <dbl>     <int>    <dbl>
1         12346  -5157.         1   77184.
2         12347  -4833.       182    4310 
3         12348  -4906.        31    1797.
4         12349  -4850.        73    1758.
5         12350  -5141.        17     334.
6         12352  -4867.        85    2506.

💡 Insight:
These RFM (Recency, Frequency, Monetary) features provide a structured way to differentiate customer groups based on their spending behavior.

3 Customer Segmentation Using Clustering

3.1 Step 1: Standardizing the Data

Since Recency, Frequency, and Monetary values have different scales, we normalize the dataset to ensure fair clustering.

3.2 Step 2: Applying Clustering Algorithms

We explore three clustering techniques to segment customers effectively:

3.2.1 1️⃣ K-Means Clustering

  • Groups customers into 3 segments based on RFM scores.

  • Assigns each customer to the closest cluster centroid.

Findings:

  • Cluster 1: High-value, frequent buyers.

  • Cluster 2: Occasional buyers with medium spending.

  • Cluster 3: Low-frequency, low-spending customers.

3.2.2 2️⃣ DBSCAN (Density-Based Clustering)

  • Detects natural groups of customers based on purchase density.

  • Identifies outliers (noise customers) who don’t belong to any cluster.

Findings:

  • DBSCAN captures natural customer clusters without needing to predefine K.

  • Customers labeled as -1 are outliers, indicating irregular buying patterns.

3.2.3 3️⃣ Gaussian Mixture Model (GMM)

  • Provides a soft clustering approach, meaning customers belong to multiple clusters with probabilities.

  • Suitable for complex distributions where hard assignments (like K-Means) don’t work well.

Findings:

  • GMM avoids hard boundaries and assigns probability scores to each cluster.

  • Useful when customer behavior overlaps between segments.

3.3 Step 3: Evaluating Cluster Performance

To measure clustering effectiveness, we compute silhouette scores:

Findings:

  • A higher silhouette score indicates better-defined clusters.

  • K-Means performs well, but DBSCAN is more effective in identifying noise customers.

4 RFM Heatmap Analysis

The RFM Heatmap visually represents how Recency, Frequency, and Monetary values vary across different customer clusters identified by K-Means clustering.

4.1 Interpretation of the Heatmap

  • The x-axis represents the RFM metrics (Recency, Frequency, and Monetary).

  • The y-axis represents different customer clusters as identified by K-Means.

  • The color gradient (from blue to red) indicates low to high values for each RFM metric within each cluster.

4.2 💡 Key Insights from the Heatmap:

  1. Monetary (M) & Frequency (F) are strongly correlated for high-value customers

    • The cluster with the highest monetary spending (red color) is also the one with high frequency, indicating loyal, high-spending customers.
  2. Recency (R) is higher (blue) for inactive or lost customers

    • If a cluster shows a high recency (blue color), it means those customers haven’t purchased for a long time.

    • These customers might need win-back campaigns (e.g., discounts, re-engagement emails).

  3. Clusters with lower Frequency & Monetary have higher Recency

    • Occasional shoppers tend to have lower transaction counts and spending, while frequent buyers appear in clusters with higher monetary and frequency values.

4.3 Business Application

High-value customers (Low Recency, High Frequency & Monetary):

  • Target with loyalty programs, VIP discounts, and exclusive offers.
    Occasional buyers (Mid Recency, Mid Frequency & Monetary):

  • Encourage repeat purchases with personalized promotions and email campaigns.
    Inactive customers (High Recency, Low Frequency & Monetary):

  • Implement win-back campaigns with discounts & reactivation offers.

5 Customer Retention & Churn Analysis

In addition to customer segmentation, it’s crucial to analyze customer churn, i.e., whether certain customers stop purchasing or if their buying frequency decreases. By tracking customer retention trends, businesses can implement targeted marketing strategies to reduce churn and increase repeat purchases.

5.1 Calculating Customer Churn Rate

5.1.1 Methodology:

  1. Calculate the total number of active customers per month (customers who made at least one purchase in that month).

  2. Identify repeat customers (customers who made purchases in consecutive months).

  3. Compute the churn rate using the formula:

\[ Churn Rate=(1−Repeat Customers / Total Customers​)×100 \]

  • Repeat Customers: Customers who purchased both in the current month and the previous month.

  • Total Customers: The total number of unique customers in that month.

5.2 Business Insights:

💡 Understanding Churn Rate Trends:

  • If churn rate increases over time, it indicates declining customer retention, requiring strategic intervention.

  • If churn rate spikes in specific months, it may be due to seasonal trends (e.g., post-holiday drop in purchases).

  • Reducing churn is more cost-effective than acquiring new customers, making retention strategies crucial.

Business Strategies:

  • Win-back campaigns: Send special offers to customers who haven’t purchased in the last 60-90 days.

  • Customer segmentation-based offers: Identify high-churn-risk customers and provide personalized discounts.

  • Automated email marketing: Trigger email reminders when a customer hasn’t made a purchase for a set period.

6 ROI Analysis of Clustering-Based Marketing

This code calculates the Return on Investment (ROI) for different clustering-based marketing strategies by adjusting the targeted customer percentage and campaign costs per customer.

Key Adjustments:

  • Targeting only 2% of total customers instead of 10%, ensuring a more focused marketing approach.

  • Increasing campaign cost per customer to $50-$60, reflecting more realistic advertising expenses.

[1] "Final Corrected ROI (K-Means): 310.85 %"
[1] "Final Corrected ROI (DBSCAN): 413.57 %"
[1] "Final Corrected ROI (GMM): 348.2 %"

6.1 Interpretation of ROI Results

  • DBSCAN has the highest ROI (413.57%)

    • Why? DBSCAN identifies high-density customer clusters, meaning it focuses on the most engaged and valuable customers.

    • Business Insight: This suggests that targeting customers in DBSCAN clusters yields the best returns, as these customers convert more often and spend more.

    • Marketing Strategy:

      • Personalized loyalty rewards for these high-value customers.

      • Exclusive membership perks to encourage repeat purchases.

  • GMM has a strong ROI (348.2%)

    • Why? GMM allows for probabilistic customer segmentation, meaning customers are assigned weights based on multiple purchasing behaviors.

    • Business Insight: This means GMM works well for identifying customers who may convert at a slightly lower but still profitable rate.

    • Marketing Strategy:

      • Predictive pricing models to maximize profits from high-value segments.

      • Automated personalized recommendations based on their likelihood to purchase.

  • K-Means has the lowest (but still high) ROI (310.85%)

    • Why? K-Means provides general customer groupings, but may misclassify certain high-value customers.

    • Business Insight: The ROI is still strong, indicating that K-Means works well for large-scale campaigns but may not be as precise as DBSCAN.

    • Marketing Strategy:

      • Broad-based promotions (e.g., seasonal sales, discount codes).

      • Email campaigns for re-engagement of medium-value customers.

6.2 Key Takeaways

DBSCAN is the best option for maximizing ROI, ideal for targeted high-value customer campaigns.
GMM performs well for predictive modeling, useful for dynamic pricing and customer lifetime value estimation.
K-Means is suitable for large-scale marketing, effective for general promotions and wide audience reach.

6.3 Business Strategy Recommendations

Clustering Method ROI (%) Best Use Case Suggested Marketing Strategy
DBSCAN 413.57% High-value customer retention VIP programs, loyalty rewards, personalized offers
GMM 348.2% Predictive modeling AI-driven pricing, automated recommendations
K-Means 310.85% Broad audience targeting Email marketing, seasonal discounts, large campaigns

6.4 Final Conclusion

The analysis confirms that DBSCAN is the most effective clustering approach for maximizing marketing ROI, making it the preferred method for targeting high-value customers. However, K-Means and GMM still provide solid returns, each being effective for different marketing strategies.

🚀 Final Business Decision:

  • Use DBSCAN for premium customer engagement (e.g., loyalty programs).

  • Use GMM for personalized marketing (e.g., predictive pricing).

  • Use K-Means for large-scale promotions (e.g., email & social media ads).

7 Time-Series Analysis: Monthly Sales Trend

Understanding sales trends over time is essential for business planning.

💡 Insights:

  • Sales fluctuate seasonally, peaking in holiday months.

  • A declining trend may indicate customer churn, requiring targeted marketing efforts.

8 Time-Series Forecasting for Sales Prediction

Understanding historical sales trends is useful, but predicting future sales allows businesses to optimize inventory, marketing budgets, and promotional strategies. Here, we use the ARIMA (AutoRegressive Integrated Moving Average) model to forecast future sales.

8.1 Data Preprocessing

Before running the forecasting model, we:

  1. Aggregate transaction data on a monthly basis by summing total sales.

  2. Ensure chronological order for accurate forecasting.

  3. Convert data into a time-series object (TS) for forecasting.

8.2 Applying ARIMA for Sales Forecasting

ARIMA is a widely used model for time-series forecasting, combining:

  • AutoRegressive (AR): Uses past values to predict future trends.

  • Integrated (I): Differencing to make data stationary.

  • Moving Average (MA): Accounts for past errors in predictions.

8.3 Business Insights:

💡 Why is sales forecasting important?

  • Identify future revenue trends: Determine if sales are expected to increase or decline.

  • Optimize inventory planning: If sales are expected to increase, ensure sufficient stock availability.

  • Adjust marketing strategies: If lower sales are forecasted for upcoming months, businesses can launch promotions to boost revenue.

Actionable Business Strategies:

  • If sales are forecasted to decline, consider launching:

    • Promotional campaigns: Limited-time discounts to stimulate demand.

    • Email marketing: Engage previous customers with personalized offers.

  • If sales are expected to rise, businesses should:

    • Optimize inventory to prevent stock shortages.

    • Increase marketing spend on high-performing products.

9 Business Insights & Recommendations

📌 High-Value Customers (Cluster 1 - KMeans & DBSCAN)

  • Marketing Strategy: Loyalty programs, premium memberships.

  • Expected ROI: High, due to frequent, high-spend behavior.

📌 Occasional Buyers (Cluster 2)

  • Marketing Strategy: Discount coupons, targeted ads.

  • Expected ROI: Moderate, can increase purchase frequency.

📌 Inactive Customers (Cluster 3)

  • Marketing Strategy: Win-back campaigns, personalized promotions.

  • Expected ROI: Low, but crucial for reducing churn.

10 Conclusion

This project successfully demonstrates customer segmentation and sales trend analysis, providing actionable insights for business strategy.

🚀 Next Steps:

  • Integrate predictive modeling to forecast future sales trends.

  • Deploy results in BI dashboards for real-time customer tracking.

  • Optimize marketing budget allocation based on cluster insights.